A Cross Training Corrective Approach for Web Pages Classification
نویسندگان
چکیده
Textual document classification is one challenging area of data mining. Web page classification is a type of textual document classification. However, the text contained in web pages is not homogenous since a web page can discuss related but different subjects. Thus, results obtained by a textual classifier on web pages are not as better as those obtained on textual documents. Therefore, we need to use a method to enhance results of those classifiers or more precisely a technique to correct their results. One category of techniques that address this problem is to use the test set hidden underlying information to correct results assigned by a textual classifier. In this paper, we propose a method that belongs to this category. Our method is a Cross Training based Corrective approach (CTC) for web page classification that learns information from the test set in order to fix classes initially assigned by a text classifier on that test set. This adjustment leads to a significant improvement on classification results. We tested our approach using three traditional classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB) and K Nearest Neighbors (KNN), on four subsets of the Open Directory Project (ODP). Results show that our collective and corrective approach, when applied after SVM, NB or KNN, enhances their classification results by up to 12.39%.
منابع مشابه
Iterative cross-training: An algorithm for learning from unlabeled Web pages
The paper presents a learning method, called Iterative Cross-Training (ICT) , for classifying Web pages in two classification problems, i.e., (1) classification of Thai/non-Thai Web pages, and (2) classification of course/non-course home pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to effectively use unlabeled examples to iterat...
متن کاملA Comparative Study of Web-pages Classification Methods using Fuzzy Operators Applied to Arabic Web-pages
In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and compared in this study. These measures include: Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and Bounded Difference ap...
متن کاملEfficient Prediction of Cross-Site Scripting Web Pages using Extreme Learning Machine
Malicious code is a way of attempting to acquire sensitive information by sending malicious code to the trustworthy entity in an electronic communication. JavaScript is the most frequently used command language in the web page environment. If the hackers misuse the JavaScript code there is a possibility of stealing the authentication and confidential information about an organization and user. ...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملDiscovering Test Set Regularities in Relational Domains
Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the trainin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJCSA
دوره 12 شماره
صفحات -
تاریخ انتشار 2015